Agent joining device, method, and program

ABSTRACT

It is possible to construct an agent that can deal with even a complicated task. For a value function for obtaining a policy for an action of an agent that solves an overall task represented by a weighting sum of a plurality of part tasks, an overall value function is obtained, which is a weighting sum of a plurality of part value functions learned in advance to obtain a policy for an action of a part agent that solves the part tasks for each of the plurality of part tasks using a weight for each of the plurality of part tasks. The action of the agent corresponding to the overall task is determined using a policy obtained from the overall value function and the agent is caused to act.

TECHNICAL FIELD

The present invention relates to an agent coupling device, a method anda program, and more particularly, to an agent coupling device, a methodand a program for solving a task.

BACKGROUND ART

with the breakthrough of deep learning, AI (artificial intelligence)technologies are attracting great attention. Above all, deepreinforcement learning combined with a learning framework called“reinforcement learning” that does autonomous trial and error hasachieved great results in the field of game AI (computer game, igo(board game of capturing territory) or the like (see Non-PatentLiterature 1). In recent years, application of deep reinforcementlearning to robot control, drone control and adaptive control of trafficsignals (see Non-Patent Literature 2) or the like is being promoted.

CITATION LIST Non-Patent Literature

Non-Patent Literature 1: Human-level control through deep reinforcementlearning, Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David andRusu, Andrei A. and Veness, Joel and Bellemare, Marc G and Graves, Alexand Riedmilier, Martin and Fidjel and, Andreas K and Ostrovski, Georgand others, Nature, 2015.

Non-Patent Literature 2: Using a deep reinforcement learning agent fortraffic signal control, Genders, Wade and Razavi, Saiedeh, arXivpreprint arXiv: 1611.01142, 2016.

Non-Patent Literature 3: Reinforcement Learning with Deep Energy-BasedPolicies, Haarnoja, Tomas and Tang, Haoran and Abbeel, Pieter andLevine, Sergey, ICML, 2017.

Non-Patent Literature 4: Composable Deep Reinforcement Learning forRobotic Manipulation, Haarnoja, Tuomas and Pong, Vitchyr and Zhou,Aurick and Dalal, Murtaza and Abbeel, Pieter and Levine, Sergey, arXivpreprint arXiv: 1803.06773, 2018.

Non-Patent Literature 5: Distilling the knowledge in a neural network,Hinton, Geoffrey, and Vinyals, Oriol, and Dean, Jeff, arXiv preprintarXiv: 1503.02531 (2015).

SUMMARY OF THE INVENTION Technical Problem

However, deep reinforcement learning has the following two weak points.

One is that deep reinforcement learning requires trial and error by anaction subject (e.g., robot) called an “agent,” which generally takes along learning time.

The other is that since a learning result of reinforcement learningdepends on a given environment (task), if the environment changes,learning needs to be (basically) redone from zero.

Therefore, even if a task seems to be similar in the eyes of humans, thetask needs to be relearned every time the environment changes, requiringa lot of efforts (manpower cost, calculation cost).

Bearing the aforementioned problem in mind, an approach is under study,in which a task to be a base and an agent to solve the task (called a“part task” and a “part agent” respectively) are learned in advance andan agent that solves a complicated overall task is created (constituted)by combining the part agent and the part task (see Non-PatentLiteratures 3 and 4). However, since such an existing techniqueconsiders only a case where a task represented by a simple average isconstructed using a simple average of the part agent, the number ofapplicable scenes is limited.

An object of the present invention, which has been made in view of theabove circumstances, is to provide an agent coupling device, a methodand a program capable of constructing an agent that can deal with even acomplicated task.

Means for Solving the Problem

In order to attain the above described object, an agent coupling deviceaccording to a first invention is configured by including an agentcoupling unit that obtains an overall value function with respect to avalue function for obtaining a policy for an action of an agent thatsolves an overall task represented by a weighting sum of a plurality ofpart tasks, the overall value function being a weighting sum of aplurality of part value functions learned in advance to obtain a policyfor an action of a part agent that solves the part tasks for each of theplurality of part tasks using a weight for each of the plurality of parttasks and an execution unit that determines the action of the agentcorresponding to the overall task using the policy obtained from theoverall value function and causes the agent to act.

In the agent coupling device according to the first invention, the agentcoupling unit may obtain, as a neural network that approximates theoverall value function, a neural network constructed by adding a layerto be output with a weight assigned to each of the plurality of parttasks for a neural network learned in advance so as to approximate thepart value function for each of the plurality of part tasks, and theexecution unit may determine an action of an agent for the overall taskusing a policy obtained from the neural network that approximates theoverall value function and cause the agent to act.

The agent coupling device according to the first invention may furtherinclude a relearning unit that relearns a neural network thatapproximates the overall value function based on an action result of theagent by the execution unit.

In the agent coupling device according to the first invention, the agentcoupling unit may obtain, for each of the plurality of part tasks, aneural network constructed by adding a layer to be output with a weightassigned to each of the plurality of part tasks for a neural networklearned in advance so as to approximate the part value function, as aneural network that approximates the overall value function and create aneural network having a predetermined structure corresponding to theneural network that approximates the overall value function, and theexecution unit may determine the action of the agent for the overalltask using the policy obtained from the neural network having thepredetermined structure and cause the agent to act.

The agent coupling device according to the first invention may furtherinclude a relearning unit that relearns the neural network having thepredetermined structure based on the action result of the agent by theexecution unit.

An agent coupling method according to a second invention includes a stepof obtaining an overall value function with respect to a value functionfor obtaining a policy for an action of an agent that solves an overalltask represented by a weighting sum of a plurality of part tasks, theoverall value function being a weighting sum of a plurality of partvalue functions learned in advance to obtain a policy for an action of apart agent that solves the part tasks for each of the plurality of parttasks using a weight for each of the plurality of part tasks and a stepof an execution unit determining the action of the agent correspondingto the overall task using a policy obtained from the overall valuefunction and causing the agent to act.

A program according to a third invention is a program for causing acomputer to function as the respective components of the agent couplingdevice according to the first invention.

Effects of the Invention

The agent coupling device, the method and the program of the presentinvention can achieve an effect of constructing an agent that can dealwith even a complicated task.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a diagram illustrating a configuration example of a newnetwork by DQN.

FIG. 2 is a block diagram illustrating a configuration of an agentcoupling device according to an embodiment of the present invention.

FIG. 3 is a block diagram illustrating a configuration of an agentcoupling unit.

FIG. 4 is a flowchart illustrating an agent processing routine in theagent coupling device according to the embodiment of the presentinvention.

DESCRIPTION OF EMBODIMENTS

In view of the above problems, an embodiment of the present inventionproposes a technique of constructing an overall task represented by aweighting sum using a weighting sum of a part agent. Examples of theoverall task represented by a combination of weights include shootinggames and signal control shown below. In a shooting game, it is assumedthat a learning result A of solving a part task A of shooting down anenemy A and a learning result B of solving a part task B of shootingdown an enemy B have already been obtained. At this time, for example, atask whereby 50 points are gained when the enemy A is shot down and 10points are gained when the enemy B is shot down is expressed as aweighting sum of the part task A and the part task B. Similarly, it isassumed that a learning result A of solving a part task A wherebygeneral vehicles are caused to pass with a short waiting time and alearning result B of solving a part task B whereby public vehicles suchas buses are caused to pass with a short waiting time in signal controlhave already been obtained. At this time, for example, a task ofminimizing [waiting time of general vehicle waiting time of publicvehicle×5] is expressed by the above weighting sum of the part task Aand the part task B. In the embodiment of the present invention, thelearning result can be constructed also for the task represented by theabove weighting sum, and a learning result of solving a complicated taskcan be obtained without relearning by only combining part agents for anew task or a learning result can be obtained in a shorter time thanrelearning from zero.

A technique of reinforcement learning, which is a premise, will bedescribed before describing details of the embodiment of the presentinvention.

[Reinforcement Learning]

Reinforcement learning is a technique of finding an optimum policy witha setting defined as a Markov Decision Process (MDP) (ReferenceLiterature 1).

[Reference Literature 1]

Reinforcement learning: An introduction, Richard S Sutton and Andrew GBarto, MIT press Cambridge, 1998.

Simply stated, the MDP describes interaction between an action subject(e.g., robot) and an outside world. The MDP is defined by five sets (S,A, P, R, γ): a set of states S={s₁, s₂, . . . , s_(S)} that a robot cantake, a set of actions A={a₁, a₂, . . . , a_(A)} that the robot cantake, a transition function P={p^(a) _(ss′)}_(s,s′,a) (where Σ_(s′)p^(a)_(ss′)=1) that defines the way of transition in state when the robottakes an action in a certain state, a reward function R={r₁, r₂, . . . ,r_(S)} that gives information on how good an action taken by the robotin the certain state is, and a discount rate (where, 0≤γ1) that controlsa degree of consideration for a reward to be received in the future.

In this setting of the MDP, the robot is given a degree of freedomregarding what action is to be executed in each state. A function fordefining a probability that an action a will be executed when the robotis in each state s is called a “policy,” and is written as π. The policyπ for the action a when the state s is given is expressed by(Σ_(a)π(a|s)=1). Reinforcement learning obtains an optimum policyπ*_(std) which is a policy for maximizing an expected discount sum ofrewards to be obtained from most currently until the future from among aplurality of policies.

$\pi_{std}^{*} = {\arg{\max\limits_{\pi}{\lim\limits_{T\rightarrow\infty}{E^{\pi}\left\lbrack {\sum\limits_{k = 0}^{T}{\gamma^{k}{\mathcal{R}\left( S_{k} \right)}}} \right\rbrack}}}}$

It is a value function Q^(π) that plays an important role in derivingthe optimum policy.

${Q^{\pi}\left( {s,a} \right)} = {\lim\limits_{T\rightarrow\infty}{E^{\pi}\left\lbrack {{{{\sum\limits_{k = 0}^{T}{\gamma^{k}{\mathcal{R}\left( S_{k} \right)}}}❘S_{0}} = s},{A_{0} = a}} \right\rbrack}}$

The value function Q^(π) represents an expected discount sum of rewardsobtained when the action a is executed in the state s and the action acontinues to be executed to infinity according to the policy π after theexecution. If the policy π is the optimum policy, a value function Q*(optimum value function) in the optimum policy is known to satisfy thefollowing relationship and this expression is called a “Bellman optimumequation.”

${Q^{\pi}\left( {s,a} \right)} = {{\mathcal{R}(s)} + {\gamma{\sum\limits_{s^{\prime}}{p_{{ss}^{\prime}}^{a}{\max\limits_{a^{\prime}}{Q^{*}\left( {s^{\prime},a^{\prime}} \right)}}}}}}$

Many techniques of reinforcement learning represented by Q learningestimates this optimum value function using the relationship in theabove expression first, makes the following setting using the estimationresult and thereby obtains the optimum policy π*.

${\pi_{std}^{*}\left( {a❘s} \right)} = {\delta\left( {a - {\arg{\max\limits_{a^{\prime}}{Q^{*}\left( {s,a^{\prime}} \right)}}}} \right)}$

Where δ(·) represents a delta function.

[Maximum Entropy Reinforcement Learning]

An approach called a “maximum entropy reinforcement learning” isproposed on the basis of the above standard reinforcement learning(Non-Patent Literature 3). This approach needs to be used to construct anew policy by coupling learning results.

Unlike the standard reinforcement learning, the maximum entropyreinforcement learning obtains an optimum policy π*_(me) that maximizesan expected discount sum of rewards obtained from most currently untilthe future and entropy of the policy.

$\pi_{me}^{*} = {\arg\;{\max\limits_{\pi}{\lim\limits_{T\rightarrow\infty}{E^{\pi}\left\lbrack {\sum\limits_{k = 0}^{T}{\gamma^{k}\left\{ {{\mathcal{R}\left( S_{k} \right)} + {{\alpha\mathcal{H}}\left( {\pi\left( {\cdot {❘S_{k}}} \right)} \right)}} \right\}}} \right\rbrack}}}}$

Where α represents entropy of a distribution {π(a₁|S_(k)), . . . ,π(a_(A)|S_(k))} that defines a selection probability of each action whena weight parameter, H(π(·|S_(k))) is in a state S_(k). Similarly to theprevious section, an (optimum) value function Q*_(soft) can be definedas shown in following Expression (1) in the maximum entropyreinforcement learning.

$\begin{matrix}{{Q_{soft}^{*}\left( {s,a} \right)} = {\lim\limits_{T\rightarrow\infty}{E^{\pi}\left\lbrack {{{{\sum\limits_{k = 0}^{T}{\gamma^{k}\left\{ {{\mathcal{R}\left( S_{k} \right)} + {{\alpha\mathcal{H}}\left( {\pi_{me}^{*}\left( {\cdot {❘S_{k}}} \right)} \right)}} \right\}}}❘S_{0}} = s},{A_{0} = a}} \right\rbrack}}} & (1)\end{matrix}$

The optimum policy is given using this value function by followingExpression (2).

$\begin{matrix}{\pi_{me}^{*} = {\left( {a❘s} \right){{\exp\left( {\frac{1}{\alpha}\left\{ {{Q_{soft}^{*}\left( {s,a} \right)} - {V_{soft}^{*}(s)}} \right\}} \right)}.}}} & (2)\end{matrix}$

Where, V*_(soft) is given as follows.

${V_{soft}^{*}(s)} = {{\alpha log}{\sum\limits_{a^{\prime}}{\exp\left( {\frac{1}{\alpha}{Q_{soft}^{*}\left( {s,a^{\prime}} \right)}} \right)}}}$

In this way, the optimum policy is expressed as a stochastic policy inmaximum entropy reinforcement learning. Note that the value function canbe estimated using the following Bellman equation in maximum entropyreinforcement learning as in the case of normal reinforcement learning.

${Q_{soft}^{*}\left( {s,a} \right)} = {{\mathcal{R}(s)} + {\gamma{\sum\limits_{s^{\prime}}{p_{{ss}^{\prime}}^{a}{V_{soft}^{*}\left( s^{\prime} \right)}}}}}$

[Configuration of Policy Using Simple Average (Existing Technique)]

First, a method of coupling learning results using the above existingtechnique will be described. In consideration of two MDPs differing onlyin reward functions: MDP-1 (S, A, P, R₁, γ) and MDP-2 (S, A, P, R₂, γ),Expression (1), which becomes as optimum value function of maximumentropy reinforcement learning is written as respective part valuefunctions Q₁ and Q₂ regarding MDP-1 and MDP-2. Tasks for the respectiveMDPs have already been learned, and Q₁ and Q₂ are assumed to be known.Using these part value functions, consider constructing a policy ofMDP-3 (S, A, P, R₃, γ), which is a target having reward R₃=(R₁+R₂)/2defined by simple average.

According to the existing technique (Non-Patent Literature 4), theoverall value function Q_(Σ) in the above setting is defined as follows.

Q _(Σ)½(Q ₁ +Q ₂)

Assuming the overall value function Q_(Σ) as the optimum value functionQ₃ of MDP-3, by substituting the optimum value function Q₃ in Expression(2), the coupled policy π_(Σ) is obtained. As a matter of course, sinceQ_(Σ) generally does not coincide with the optimum value function Q₃ ofMDP-3, the policy π_(Σ) created using the above coupling method does notcoincide with the optimum policy π*₃ of MDP-3. However, the presence ofan expression that holds between the value function Q^(πΣ) and Q₃ whenan action is performed according to π_(Σ) is proven (Non-PatentLiterature 4), it is obvious that there is a relation between bothvalues although it cannot be said to be a good approximation. Thus, theexisting technique uses π_(Σ) as an initial policy when learning π_(Σ)using MDP-3, and thereby experimentally shows that learning can beachieved at a smaller count of learning than relearning from zero. Inthis way, the value function Q_(Σ) is used to obtain a policy for anaction of an agent to solve an overall task represented by a weightingsum of a plurality of part tasks.

However, the existing technique only considers a case where a taskrepresented by a simple average is constructed using a simple average ofpart agents, and the number of applicable scenes is limited.

Principles according to Embodiment of Present Invention

Hereinafter, a method of constructing policies used in the embodiment ofthe present invention will be described.

[Configuration of Weighting Sum Policy]

As with the existing research, there are two MDPs differing only inreward functions: MDP-1: (S, A, P, R₁, γ) and MDP-2: (S, A, P, R₂, γ),the part value function of maximum entropy reinforcement learning inthis MDP has already been learned, and Q₁ and Q₂ are assumed to beknown.

With this setting, the embodiment of the, present invention considersconstructing a policy of MDP-3: (S, A, P, R₃, γ) which is a targethaving reward R₃=β₁R₁+β₂R₂ defined by a weighting sum. β₁ and β₂ areknown weight parameters.

The method proposed in the embodiment of the present invention isdefined by following Expression (3).

Q _(Σ)=β₁ Q ₁+β₂ Q ₂   (3)

Assuming Q_(Σ) as an optimum value function Q₃ of MDP-3, Q_(Σ) issubstituted in Expression (2) to obtain the coupled policy π_(Σ). Q_(Σ)generally does not coincide with the optimum value function Q₃ of MDP-3,and the policy π_(Σ) created using the above coupling method does notcoincide with the optimum policy π*₃ of the MDP-3. As described above,there is an expression that holds between the value function Q^(πΣ) andQ₃ when an action is performed according to π_(Σ). Thus, it is assumedthat π_(Σ) is used as a policy to solve the task corresponding to MDP-3.By using π_(Σ) as an initial policy when performing learning usingMDP-3, learning can be achieved at a smaller count of learning thanrelearning from zero.

[When Performing Relearning]

As a specific example when performing relearning, a case will be shownwhere when a neural network (hereinafter also described as a “network”)that approximates the part value functions Q₁ and Q₂ has alreadyperformed learning using a Deep Q-Network (DQN) (Non-Patent Literature2), these networks are combined to create an initial value ofrelearning.

Mainly the following two methods can be considered. One is a methodusing simple coupling of the networks as they are. A news network iscreated in which a layer is added above art output layer of the networkthat returns the value of the learned Q₁ and the network that returnsthe value of Q₂ by assigning weights to their values as shown inExpression (3) and outputting the values. Relearning is performed byusing this network as an initial value of the function that returns avalue function. FIG. 1 illustrates a configuration example of the newnetwork using DQN.

The other uses a technique called “distillation” (Non-Patent Literature5). According to this technique, in a situation in which a networkcalled a “Teacher Network” that produces a learning result is given, aStudent Network using the number of network layers and an activationfunction or the like different from the Teacher Network is learned so asto have an input/output relationship similar to the Teacher Network. Bycreating the Student Network by using the network created by simplecoupling as the first method as the Teacher Network, it is possible tocreate a network to be used as an initial value.

When the first approach is used, since the newly created networkincludes a number of parameters corresponding to the number of addedparameters of the networks of Q₁ and Q₂, a problem may be produced inthe case of a problem that the number of parameters is large. However,the new network can be simply created instead. On the contrary, thesecond approach needs to learn the Student Network, and so creating anew network may take much time, but it is possible to create a newnetwork with fewer parameters.

Hereinafter, embodiments of the present invention will be described indetail with reference to the accompanying drawings.

Configuration of Agent Coupling Device According to Embodiment ofPresent Invention

Next, a configuration of an agent coupling device according to anembodiment of the present invention will be described. As shown in FIG.2, an agent coupling device 100 according to the embodiment of thepresent invention can be constructed of a computer including a CPU, aRAM and a ROM that stores a program to execute an agent processingroutine, which will be described later, and various types of data. Theagent coupling device 100 is functionally provided with an agentcoupling unit 30, an execution unit 32 and a relearning unit 34 as shownin FIG. 2.

The execution unit 32 is configured by including a policy acquisitionunit 40, an action determination unit 42, an operation unit 44 and afunction output unit 46.

As shown in FIG. 3, the agent coupling unit 30 is configured byincluding a weight parameter processing unit 310, a part agentprocessing unit 320, a coupling agent creation unit 330, a couplingagent processing unit 340, a weight parameter recording unit 351, a partagent recording unit 352 and a coupling agent recording unit 353. In theembodiment of the present invention, it is assumed that part valuefunctions Q₁ and Q₂ of part tasks and an overall value function Q_(Σ)are configured as a neural network learned in advance so as toapproximate a value function using the above technique such as DQN. Notethat a linear sum or the like may be used when it can be simplyexpressed.

Through the processes by the following respective processing units, theagent coupling unit 30 obtains, for each of a plurality of part tasks, aneural network constructed by adding a layer to be output with a weightassigned to each of the plurality of part tasks as a neural network thatapproximates an overall value function Q_(Σ) for a neural networklearned in advance so as to approximate the part value functions (Q₁,Q₂).

The weight parameter processing unit 310 stores predetermined weightparameters β₁ and β₂ when coupling part tasks in the weight parameterrecording unit 351.

The part agent processing unit 320 stores information relating to partvalue functions of part tasks (part value functions Q₁ and Q₂ themselvesor network parameters that approximate them using DQN or the like) inthe part agent recording unit 352.

The coupling agent creation unit 330 receives weight parameters β₁ andβ₂ of the weight parameter recording unit 351, and Q₁ and Q₂ of the partagent recording unit 352 as input, and stores information relating tooverall value function Q_(Σ)=β₁Q₁+β₂Q₂, which is the weighed couplingresult (Q_(Σ) itself or neural network parameter that approximates Q_(Σ)or the like) in the coupling agent recording unit 353.

The coupling agent processing unit 340 outputs network parameterscorresponding to the overall value function QΣ of the coupling agentrecording unit 353 to the execution unit 32.

The execution unit 32 determines an action of an agent on the overalltask using a policy obtained from a network corresponding to the overallvalue function Q_(Σ) through each processing unit, which will bedescribed below, and causes the agent to act.

The policy acquisition unit 40 replaces Q*_(soft) in above Expression(2) with a network corresponding to the overall value function Q_(Σ)based on the network corresponding to the overall value function Q_(Σ)output from the agent coupling unit 30, and acquires a policy π_(Σ).

The action determination unit 42 determines an of the agentcorresponding to the overall task based on the policy acquired by thepolicy acquisition unit 40.

The operation unit 44 controls the agent so as to perform the determinedaction.

The function output unit 46 acquires a state S_(k) based on the actionresult of the agent and outputs the state S_(k) to the relearning unit34. Note that after a certain number of actions, the function outputunit 46 acquires an action result of the agent and the relearning unit34 relearns a neural network that approximates the overall valuefunction Q_(Σ).

The relearning unit 34 relearns the neural network that approximates theoverall value function Q_(Σ) so that the value of the reward functionR₃=β₁R₁+β₂R₂ increases based on the state S_(k) based on the actionresult of the agent by the execution unit 32.

The execution unit 32 repeats the processes by the policy acquisitionunit 40, the action determination unit 42 and the operation unit 44using the neural network that approximates the relearned overall valuefunction Q_(Σ) until a predetermined condition is satisfied.

Operation of Agent Coupling Device According to Embodiment of PresentInvention

Next, operation of the agent coupling device 100 according to theembodiment of the present invention will be described. The agentcoupling device 100 executes an agent processing routine shown in FIG.4.

First, in step S100, the agent coupling unit 30 obtains, for each of aplurality of part tasks, a neural network constructed by adding a layerto be output with a weight assigned to each of the plurality of parttasks for a neural network learned in advance so as to approximate partvalue functions (Q₁, Q₂), as a neural network that approximates theoverall value function Q_(Σ).

Next, in step S102, the policy acquisition unit 40 replaces Q*_(soft) inabove Expression (2) with a network that approximates the overall valuefunction Q_(Σ) to acquire the policy π_(Σ).

In step S104, the action determination unit 42 determines an action ofan agent on the overall task based on the policy acquired by the policyacquisition unit 40.

In step S106, the operation unit 44 controls the agent so as to performthe determined action.

In step S108, the function output unit 46 determines whether apredetermined number of actions have been performed or not, proceeds tostep S110 if a predetermined number of actions have been performed orreturns to step S102 and repeats the process if a predetermined numberof actions have not been performed.

In step S110, the function output unit 46 determines whether apredetermined condition has been satisfied or not, ends the process if apredetermined condition has been satisfied or proceeds to step S112 if apredetermined condition has not been satisfied.

In step S112, the function output unit 46 acquires a state S_(k) basedon the action result of the agent and outputs the state S_(k) to therelearning unit 34.

in step S114, the relearning unit 34 relearns the neural network thatapproximates the overall value function Q_(Σ) so that the value of thereward function R₃=β₁R₁+β₂R₂ increases based on the state S_(k) based onthe action result of the agent by the execution unit 32 and returns tostep S102.

As described above, according to the agent coupling device according tothe embodiment of the present invention, it is possible to deal withvarious tasks.

Note that the present invention is not limited to the aforementionedembodiments, but various modifications or applications can be madewithout departing from the spirit and scope of the invention.

For example, although a case has been described in the aforementionedembodiments where parameters of the neural network created by simplycoupling neural networks that approximate part value functions Q₁ and Q₂are learned in relearning, the present invention is not limited to this.When a distillation technique is used, the coupling agent processingunit 340 first simply couples neural networks that approximate the partvalue functions Q₁ and Q₂, creates a neural network that approximates anoverall value function, learns parameters of the neural network having apredetermined structure so as to deal with the neural network thatapproximates the overall value function and designates the parameters asinitial values of the parameters of the neural network having thepredetermined structure. The execution unit 32 determines the action orthe agent corresponding to the overall task using the policy obtainedfrom the neural network having the predetermined structure and causesthe agent to act. The relearning unit 34 relearns the parameters of theneural network having the predetermined structure based on the actionresult of the agent by the execution unit 32. Determination andexecution of an action of the agent by the execution unit 32 andrelearning by the relearning unit 34 may be repeated.

Without relearning by the relearning unit 34, the action of the agentmay be controlled by only the agent coupling unit 30 and the executionunit 32. In this case, the coupling agent processing unit 340 may outputthe overall value function Q_(Σ) of the coupling agent recording unit353 to the execution unit 32, the execution unit 32 may determine theaction of the agent on the overall task using the policy obtained fromthe overall value function Q_(Σ) and cause the agent to act. Morespecifically, the policy acquisition unit 40 may replace Q*_(soft) inabove Expression (2) with Q_(Σ) based on the overall value functionQ_(Σ) output from the agent coupling unit 30 and acquire the policyπ_(Σ).

REFERENCE SIGNS LIST

30 Agent coupling unit

32 Execution unit

34 Relearning unit

40 Policy acquisition unit

42 Action determination unit

44 Operation unit

46 Function output unit

100 Agent coupling device

310 Parameter processing unit

320 Part agent processing unit

330 Coupling agent creation unit

340 Coupling agent processing unit

351 Parameter recording unit

352 Part agent recording unit

353 Coupling agent recording unit

1. An agent coupling device comprising: an agent coupling unit thatobtains an overall value function with respect to a value function forobtaining a policy for an action of an agent that solves an overall taskrepresented by a weighting sum of a plurality of part tasks, the overallvalue function being a weighting sum of a plurality of part valuefunctions learned in advance to obtain a policy for an action of a partagent that solves the part tasks for each of the plurality of part tasksusing a weight for each of the plurality of part tasks; and an executionunit that determines the action of the agent corresponding to theoverall task using the policy obtained from the overall value functionand causes the agent to act.
 2. The agent coupling device according toclaim 1, wherein the agent coupling unit obtains, as a neural networkthat approximates the overall value function, a neural networkconstructed by adding a layer to be output with a weight assigned toeach of the plurality of part tasks for a neural network learned inadvance so as to approximate the part value function for each of theplurality of part tasks, and the execution unit determines an action ofan agent for the overall task using a policy obtained from the neuralnetwork that approximates the overall value function and causes theagent to act.
 3. The agent coupling device according to claim 2, furthercomprising a relearning unit that relearns a neural network thatapproximates the overall value function based on an action result of theagent by the execution unit.
 4. The agent coupling device according toclaim 1, wherein the agent coupling unit obtains, for each of theplurality of part tasks, a neural network constructed by adding a layerto be output with a weight assigned to each of the plurality of parttasks for a neural network learned in advance so as to approximate thepart value function, as a neural network that approximates the overallvalue function and creates a neural network having a predeterminedstructure corresponding to the neural network that approximates theoverall value function, and the execution unit determines the action ofthe agent for the overall task using a policy obtained from the neuralnetwork having the predetermined structure and causes the agent to act.5. The agent coupling device according to claim 4, further comprising arelearning unit that relearns the neural network having thepredetermined structure based on the action result of the agent by theexecution unit.
 6. An agent coupling method comprising: a step ofobtaining an overall value function with respect to a value function forobtaining a policy for an action of an agent that solves an overall taskrepresented by a weighting sum of a plurality of part tasks, the overallvalue function being a weighting sum of a plurality of part valuefunctions learned in advance to obtain a policy for an action of a partagent that solves the part tasks for each of the plurality of part tasksusing a weight for each of the plurality of part tasks; and a step of anexecution unit determining the action of the agent corresponding to theoverall task using a policy obtained from the overall value function andcausing the agent to act.
 7. A program for causing a computer tofunction as the respective components of the agent coupling deviceaccording to claim 1.